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Abstract 

Background: First pass methods based on BLAST match are commonly used as an initial step to separate the 
different phylogenetic histories of genes in microbial genomes, and target putative horizontal gene transfer (HGT) 
events. This will continue to be necessary given the rapid growth of genomic data and the technical difficulties in 
conducting large-scale explicit phylogenetic analyses. However, these methods often produce misleading results 
due to their inability to resolve indirect phylogenetic links and their vulnerability to stochastic events. 

Results: A new computational method of rapid, exhaustive and genome-wide detection of HGT was developed, 
featuring the systematic analysis of BLAST hit distribution patterns in the context of a priori defined hierarchical 
evolutionary categories. Genes that fall beyond a series of statistically determined thresholds are identified as not 
adhering to the typical vertical history of the organisms in question, but instead having a putative horizontal origin. 
Tests on simulated genomic data suggest that this approach effectively targets atypically distributed genes that are 
highly likely to be HGT-derived, and exhibits robust performance compared to conventional BLAST-based approaches. 
This method was further tested on real genomic datasets, including Rickettsia genomes, and was compared to previous 
studies. Results show consistency with currently employed categories of HGT prediction methods. In-depth analysis of 
both simulated and real genomic data suggests that the method is notably insensitive to stochastic events such as 
gene loss, rate variation and database error, which are common challenges to the current methodology. An automated 
pipeline was created to implement this approach and was made publicly available at: https://github.com/DittmarLab/ 
HGTector. The program is versatile, easily deployed, has a low requirement for computational resources. 

Conclusions: HGTector is an effective tool for initial or standalone large-scale discovery of candidate HGT-derived genes. 
Keywords: Horizontal gene transfer, Bacterial genomes, BLAST 



Background 

Systematic studies have shown that horizontal gene 
transfer (HGT) is prevalent in prokaryotes [1-3], where 
it serves as an important driving force of microbial 
evolution [4]. HGT challenges the detection of vertical 
inheritance patterns in prokaryotes, and the application 
of conventional phylogenetic approaches to infer evolu- 
tionary history of microbial clades has seen increased 
limitations [4-9]. In essence, the ubiquitous nature of this 
process calls for the need to separate the vertical and 
horizontal patterns in evolutionary history of bacterial 



* Correspondence: qiyunzhu@buffalo.edu; kd52@buffalo.edu 

department of Biological Sciences, University at Buffalo, State University of 

New York, 109 Cooke Hall, Buffalo, NY 14260, USA 

Full list of author information is available at the end of the article 

(3 BioMed Central 



genomes. However, this is not straightforward in practice 
and is especially difficult for deep historical events because 
the horizontally acquired genes evolved along with the 
recipient genomes, gradually losing the signatures of their 
original hosts (amelioration) [10]. Furthermore, HGTs 
between closely related organisms, although common, are 
difficult to detect because in these cases donor and re- 
cipient share common compositional and phylogenetic 
features. So far, multiple computational methods have 
been developed to facilitate HGT detection, which may 
be loosely categorized into three main strategies based 
on sequence composition, phylogenetic analysis, or 
best BLAST matches [11-13]. However, there appears to 
be poor agreement between outcomes of diverse methods, 
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and comparative studies have repeatedly demonstrated 
that depending on method different sets of putative 
HGT-derived genes are identified from the same data- 
set, reflecting limitations in the current methodology 
for HGT prediction [13-15]. 

Given the rapid increase in available annotated genome 
data, and the associated computational challenge of 
analyzing such data, the BLAST best match method has 
remained a popular surrogate for first pass discovery 
analyses of gene histories that differ from the strict ver- 
tical pattern [16]. Specifically, this strategy is practiced 
by sorting BLAST hits by measures such as bit scores, 
an indicator of sequence similarity, and the best match 
organism represented by the top hit is identified for 
each gene [17]. If the best match is a distantly related 
organism, instead of one expected by vertical inherit- 
ance, then the gene is categorized as likely horizontally 
acquired [18] (see Additional file 1: Figure S1A, B). In 
practice, researchers often identify the best match using 
the criterion of bidirectional best hits (BBH) [19] to rule 
out potential paralogs. This approach has been applied in 
numerous studies [16,20-24]. Examples of programs fea- 
turing this approach include Pyphy [25], PhyloGenie [26], 
NGIBWS [27], and DarkHorse [28,29], although the latter 
also employs a user- definable filter threshold in combin- 
ation with taxonomic scaling. 

As expected, BLAST-based HGT detection has limita- 
tions. The bit score is based on sequence similarity, and 
provides only a rough estimate of the phylogenetic history 
between organisms. Also, best hits do not necessarily repre- 
sent the nearest neighbors [30]. Most importantly, unex- 
pected best hit can also be caused by reasons other than 
HGT, such as lack of sequence information in related or- 
ganisms [15], gene loss events [31], stochastic similarity 
[15] as well as database error [32]. In practice, the pre- 
dicted HGT candidates are often rejected by downstream 
phylogenetic analyses [33]. Similar to phylogenetic HGT 
detection methods, BLAST best match is effective in de- 
tecting recent HGT events, but shows reduced sensitivity 
for ancient events, when donor and recipient sequences 
have already diverged over the long history of evolution 
[34]. Additionally, merely a best match does not necessar- 
ily provide insights into the direction of gene flow. 

Despite the listed issues, the best BLAST match 
methods make use of all sequence data available in 
GenBank. This sidesteps the need for manual subsampling 
and curation of comparative sequence data that is a 
challenging and time-consuming step in phylogenetic 
methods. This strategy is therefore likely to remain a 
feasible solution to explore microbial evolution in a first 
pass step, utilizing all of the ever increasing genomic 
data [35]. 

Considering this trend, we introduce a BLAST-based 
method to facilitate the detection of horizontal gene 



histories that aims to remedy some of the above outlined 
issues. This approach starts with standard all-against-all 
BLASTP, and is followed by an investigation of the weight 
distribution of all hits grouped by phylogenetically in- 
formed user defined categories (illustrated in Figures 1 
and 2, Additional file 1: Figure SI). A general pattern of 
BLAST hit distributions (a fingerprint) of the genomes of 
interest is computed, and BLAST hit weights of each 
single gene are compared to the general fingerprint. 
This decreases sensitivity to stochastic disturbances. 
Because phylogenetic information is incorporated into 
this process, each resulting distribution is divided by 
uniquely defined cutoffs into typical and atypical gene 
populations. Using a combination of rules, a pool of 
genes that is putatively horizontally derived is reported. 
It is recommended that the atypical gene pool be sub- 
jected to downstream phylogenetic validation of HGT, 
which is implemented in this pipeline. 

With this approach, the method retains all advantages 
of BLAST-based methods, such as efficient and exhaustive 
searches, which also facilitates re-analysis of data following 
the addition of new data. Rather than using general filter 
thresholds, and subsequent refining by taxonomic metrics 
(see LPI in DarkHorse [28]), this method combines the 
two steps into one, and defines unique thresholds for each 
hierarchical level under consideration. 

In order to assess the performance and robustness of 
this pipeline regarding the identification of putatively 
HGT-derived genes, it was applied towards simulated 
genomic data with known HGT events under consider- 
ation of various evolutionary forces. The method was 
also tested on real genomic datasets from multiple 
organismal groups, as exemplified by Rickettsia, whose 
HGT patterns have been previously studied [36-41]. 
The consistency of these results was compared to those 
obtained by other methods, including sequence com- 
position and phylogenetic approaches. Overlaps between 
results were investigated and discussed in the framework of 
technical and biological challenges behind each method. 

Methods 

Method overview 

At the core of this approach is an all-against-all BLASTP 
search of the protein product of each protein-coding 
gene (referred to as gene hereafter) of the genome (s) of 
interest against the genome database. For each protein, 
the BLAST hits are recorded and sorted by their bit 
scores from high to low. The bit scores are then normal- 
ized by dividing them by the bit score of the query se- 
quence in order to account for protein length variance, 
so that every hit has a normalized bit score within the 
0-1 range [21]. The organism corresponding to each hit 
and the taxonomic ranks of each organism are identified 
and recorded from the NCBI taxonomy database. 
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HGTector: a flowchart 
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Figure 1 Flowchart illustrating the procedures of the HGTector method. Parallelograms indicate input data or information, rectangles indicate 
processing steps, diamonds indicate decisions, and rounded rectangles indicate start and end of the work flow. Graphic illustrations of hypothetical 
phylogenetic tree, BLAST hit table and fingerprint are drawn on the right side of their corresponding steps. 



To proceed with the analyses, hits of each gene have 
to be divided into self, close and distal groups. In other 
words, the pipeline doesn't use phylogenetic trees or 
taxonomic lineage information directly, but rather allows 
the user to define three relational hierarchical categories, 



with the biological questions of the research in mind 
(Figure 1, Additional file 1: Figure SI). This approach 
allows flexibility, because it can be scaled to the level 
of taxonomic/phylogenetic interest (e.g. intraspecies, 
intragroup, etc.), and it can be adjusted to frequent 
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Figure 2 Tree topology and fingerprint (distributions of BLAST hit weights) of tests on simulated genomes. One representative test using 
either the idealized tree topology (A) or the randomized tree topology (B) is depicted (see text). The kernel density function of close weight distribution 
for both topologies (C, D) shows the distribution of all genes in the input genomes in black, and that of actual positive genes (derived from HGT events 
from distal group to self group) in red. Genes involved in other simulated evolutionary events are shown in different colors in the lower panels. Locations 
of these genes in the general distribution are indicated as a rug below each plot. The scales of x-axes between the upper and the lower 
panels are identical. Cutoffs computed by the program distinguishing the atypical region from the typical region are represented in dashed 
(for relaxed criterion) and dotted (for conservative criterion) lines. 
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updates in taxonomy. The self group is considered the 
recipient, and always has to include the query genome 
(s), and, depending on analytical scale may also include 
its immediate sister organisms (e.g., different strains 
within the same species, or different species within the 
same genus). The close group will include representatives 
of the putative vertical inheritance history of the group 
(e.g., other species of the same genus, or other genera of 
the same family which the query genome belongs to). 
The distal group includes all other organisms, which 
are considered phylogenetically distant from the query 
genome (e.g., other families, orders, etc.). The method 
will then aim to identify genes that are likely derived 
from directional gene flow from groups of organisms 
within the distal group to members of the self group. 

We introduced a measure to quantify quality of 
BLAST hits of each group. This measure ("weight") is 
calculated per group by summing up the normalized 
BLAST bit scores of hits. Because three categories 
were defined a priori, this step will result in three 
weights per gene. The three weights of all genes of the 
query genome are considered as three independent 
statistical populations. If multiple genomes are analyzed 
together, the weight populations can be merged. The three 
distributions together are defined as a fingerprint of the 
input genome (s). 

A cutoff is set up to divide the weight distribution of 
each group {self close and distal) into typical (larger 
than or equal to cutoff value) and atypical (smaller than 
cutoff value). If most genes of the genome(s) have a 
vertical inheritance history, the typical portion should 
include a majority of the genes, while the atypical por- 
tion should only include a small, but significant subset 
of genes whose hits of this specific group are underrep- 
resented. The cutoff defines the stringency of prediction: 
the higher the weight cutoff, the more genes are consid- 
ered as being atypical. The pipeline implements statis- 
tical approaches to compute the cutoffs, but the user is 
free to implement and use their own statistics. Because 
each genome or set of genomes generates a different 
fingerprint, cutoffs will vary, and are not transferrable 
across analyses. 

For any gene to be predicted as putatively horizontally 
acquired, the following rules apply, which take into account 
all three weight distributions for that gene: 

Rule 1 The gene is below cutoff (atypical) in the close 
weight distribution, suggesting that the 
orthologs of the query gene are absent in all or 
most of the sister groups of the organism of 
interest. This means that the BLAST hits 
belonging to this group are significantly 
underrepresented, in terms of number of hits or 
bit score, or both (Additional file 1: Figure S1B-D). 



This phenomenon can be explained by: (i) the gene 
lacks a vertical inheritance history, or alternatively, 
(ii) the gene was vertically inherited, but 
underwent multiple independent gene loss 
events in the sister groups, a case that is usually 
less likely to be true [11]. In the ideal scenario of a 
high likelihood of HGT, the weight should be 
zero, meaning that there is no close hit (Additional 
file 1: Figure S1B,C). In real datasets however, 
there are sometimes sporadic close hits with low 
bit scores (Additional file 1: Figure SID). These 
hits may be due to stochastic sequence similarity, 
secondary HGTs, paralogous genes or other 
mechanisms. This situation typically causes false 
negatives in conventional best BLAST match 
methods [11]. However, in the present method, 
these sporadic hits will not significantly alter the 
overall weight of a gene, thus hardly affecting the 
prediction results. 

On the other hand, for a vertically transmitted 
gene, its orthologs may not always be present in 
each and every sister lineage. Occasional gene 
loss events may take away some of the expected 
number of close hits (Additional file 1: Figure 
S1E). This situation is a major source of false 
positives in best BLAST match methods [31]. 
However, this problem is effectively overcome in 
the present method, because sporadic absences 
of hits do not make the overall weight atypical. 
Moreover, one or a few high-score distal hits 
caused by natural (outgoing HGT from the self 
group to the distal group) or artificial reasons 
(contamination, mislabeling, etc.) (Additional 
file 1: Figure S1F) can easily deceive conventional 
methods [32]. However, the present method is 
immune to this problem, as the close weight 
remains unchanged in this situation. 
Rule 2 The gene is equal to or above cutoff (typical) in the 
distal weight distribution, meaning that the hits 
from distant organisms are not underrepresented 
(Additional file 1: Figure S1A-F,H). This criterion 
sets up a filter at the donor side of potential HGT 
events: given the gene was transferred from a 
representative or ancestor of organisms that 
belong to the distal group, BLAST hits of the 
distal group itself should not be underrepresented. 
Otherwise, it is not convincing to conclude that 
the gene is HGT-derived. 
The goal of this rule is stringency, in order to 
better distinguish putative HGT events from 
other scenarios that can also make a genes close 
weight atypical: (1) de novo gene origination 
within the self group, (2) inaccurate genome 
annotation that considers a non-coding region a 
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gene, or (3) HGT from an unsequenced organism 
that is not phylogenetically close to any sequenced 
groups of organism. These scenarios usually result 
in few to no distal hits (Additional file 1: Figure 
S1G, upper panel). Meanwhile, (4) sequence 
similarity due to randomness instead of homology 
may also bring in some distal hits with low bit 
scores (Additional file 1: Figure S1G, lower panel). 
Rule 3 (optional) The gene is below cutoff (atypical) 
in the self weight distribution. It means that a 
gene is only sporadically detected, rather than 
being prevalent across the whole self group of 
organisms (Additional file 1: Figure SIB, H). 
This option will restrict the prediction to a 
subset of putatively HGT-derived genes that 
were acquired by specific lineages of the self 
group, instead of the whole self group. Often, 
these could be recent transfer events. 

To assess the source of a putative HGT event, the best 
match organism from the distal group is reported. It is 
important to understand that this best match is likely 
not the actual physical donor, but may be an extant rep- 
resentative of an ancestral, extinct donor. We recom- 
mend using "donor link" to describe the directionality of 
transfer, and relationship between these organisms, in- 
stead of "donor" [37]. 

A flowchart of the procedures of this method is illus- 
trated in Figure 1. 

Computational pipeline 
General procedure 

The HGTector pipeline (publically available at: https:// 
github.com/DittmarLab/HGTector) is written in Perl, and 
is cross-platform supported, running in Windows, Mac 
OS and Linux systems. It requires only the Perl interpreter 
with its core modules, a default component of most Mac 
OS and Linux distributions and is very easy to install in 
Windows systems. In the default mode, the program 
depends on no additional software or local databases to 
run. This characteristic maximizes the ease of instal- 
lation for users without professional computer back- 
ground and resources. Optionally, it calls R [42] to 
perform advanced statistical computing and graph- 
ing. Parameters of the program are managed by a 
central configuration file, which can be created and 
edited manually or via a graphical user interface 
(GUI). The program is composed of the following 
procedures: 

First, the program performs batch BLASTP of protein 
products of multiple genes supplied by the user. Multiple 
formats ranging from simple lists of NCBI accession 
numbers to annotated genomes in GenBank format are 
supported. It runs BLASTP either via web connection 



to the NCBI server, or with a standalone BLAST program 
and a local database. Faster alternatives to the original 
BLASTP may also be used, as long as their output files 
are in a compatible format. It also harvests taxonomic in- 
formation of each hit for each gene from the NCBI tax- 
onomy database. Hierarchical taxonomic reports (NCBI 
TaxBlast) and sequences of hits (original or aligned) can 
be retrieved optionally. BLAST results are saved in 
NEXUS format [43], which can be directly viewed by 
text editors, or opened as multiple sequence alignments 
by external programs such as SeaView [44], for add- 
itional analyses. This characteristic facilitates down- 
stream analyses, and compatibility with other programs. 

BLAST hits are filtered by multiple optional functions 
to overcome putative taxon sampling bias that may 
affect BLAST hit distribution: (1) When multiple hits 
are present for one organism (e.g., dozens of copies of a 
mobile element), only the best hit is maintained, repre- 
senting the putative ortholog of the query protein [45]. 

(2) When multiple genomes are present for one species 
(e.g., hundreds of sequenced Escherichia coli strains), 
only the genome that contains the best hit is maintained. 

(3) Taxonomic name keywords or IDs that represent un- 
wanted BLAST targets, such as inaccurately defined 
taxonomic ranks (e.g., genus Clostridium) and biological 
categories (e.g., "environmental samples"), can be speci- 
fied so that these hits will be omitted. 

By default, the program will exclude genes without any 
non-self hits from subsequent analyses, because they 
may represent ORFans [46], resulting from de novo gene 
origination events (which are very rare [47]), or transfer 
events from unknown sources that are very dissimilar 
from any sequenced genomes. Alternatively, they may 
represent genome annotation errors, which have been 
long recognized as a common and perturbing issue 
[48-50]. While these genes are not considered in the 
subsequent analysis, the genes are reported as "putative 
ORFans or annotation errors", or POE, in this pipeline 
(Additional file 1: Figure S1G, upper panel). This allows 
the user to check which POEs were omitted, and if neces- 
sary make further adjustments to the analytical set up. 

Based on the retrieved taxonomic information, the pro- 
gram can automatically formulate a grouping scenario, in 
which the lowest rank that includes all input genomes is 
defined as the self group, and the next higher rank as the 
close group. Because taxonomic classifications in GenBank 
may not always reflect the most current natural groupings 
of organisms, users may manually define hierarchies based 
on current knowledge of phylogeny and the purpose of 
their research. 

With a properly defined grouping scenario, the pro- 
gram then calculates the three weights of each gene, 
computes a fingerprint of the whole genome (s), defines 
proper cutoffs and determines the population of atypical 



Zhu et al. BMC Genomics 2014, 15:717 
http://www.biomedcentral.com/1471 -21 64/1 5/71 7 



Page 7 of 18 



events, and possible HGTs based on the selected rules. 
Basic statistical parameters of the three distributions of 
weights, as well as the weight populations themselves 
are reported. The fingerprint may be visualized by 
box plots, histograms, density plots and scatter plots 
(Figures 2 and 3). Statistical analyses and graphing of 
BLAST hit distributions are automated in the program. 
They are performed using Perl codes, or by sending 
commands to R [42]. The communication between Perl 



and R is utilized by the Perl module "Statistics::R". While 
multiple statistical approaches are available for the users 
choice, the typical procedures, which are used in the tests 
described in this article, are as follows: 

Cutoff determination 

The program performs kernel density estimation [51] to 
obtain a function of probability density distribution of 
the close weight for all genes. By default it uses Gaussian 
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Figure 3 Fingerprint of seven Rickettsia genomes. BLAST hit weights of all protein-coding genes in the seven Rickettsia genomes are plotted. 
(A-C) Kernel density functions of the self, close and distal weights. The x-axis represents the weight of each gene. The y-axis represents the probability 
density of genes with the corresponding weight in the genomes. In this example, rule 1 {close weight < cutoff) and rule 2 {distal weight > =cutoff) were 
applied. The close and distal cutoffs computed under the conservative criterion are indicated by dashed lines. The values of the cutoffs are denoted in 
each panel. (D) A scatter plot of the distal weight against the close weight, showing the clustering pattern of the genes. Each dot represents one gene. 
Genes predicted to be HGT-derived are framed by a red rectangle. (E) A zoom-in view of the left part of the previous plot. Genes that fall within the 
atypical region in the close weight distribution are colored by a blue-red color scheme based on the density-based silhouette (dbs), a measure 
of confidence that this gene belongs to the atypical cluster of genes (red = high confidence). The close cutoff used in the subsequent analyses 
is indicated by a dashed line. 
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kernel smoothing with Silverman's rule-of-thumb band- 
width selector [52]. The user is allowed to choose a proper 
bandwidth selection factor that controls the smoothness 
of the curve. The function is plotted and made visible 
to the user in real time. Statistically significant local 
minima (pits) and maxima (peaks) are computed using 
the "pastecs" package [53] with default parameters 
following Kendalls information theory [54], and their 
x-coordinates are recorded and displayed to the user. 
The program then automatically identifies a local mini- 
mum separating the typical from the atypical proportion 
of the gene population under consideration. Specifically, 
the biggest peak or group of continuous peaks in terms 
of number of genes it covers is identified as the typical 
region, and the rest is defined as the atypical region 
(Figures 2 and 3). 

In addition to the apparent typical peak, there is usually a 
clearly identifiable peak located close to zero (referred to as 
the "zero peak" hereafter). This peak usually includes "ideal" 
putatively HGT-derived genes (=zero BLAST hit). 
Between the zero peak and the typical peak is a transitional 
zone that likely consists of genes with ambiguous evolu- 
tionary history (see Results and Discussion). 

The program automatically reports two cutoffs: the 
x-coordinate of the identified local minimum is naturally 
chosen as a cutoff (referred to as the "relaxed cutoff 
hereafter). However, based on results of repeated tests, 
we recommend using the second cutoff, which is de- 
fined as the arithmetic mean of the x-coordinates of the 
local maximum of the zero peak and the local minimum 
of the selected pit (referred to as the "conservative cut- 
off hereafter) (Figure 2C, D). The choice between the 
two cutoffs depends on the goal of research, but for the 
identification of putative HGT events, the conservative 
cutoff is preferred as it meets a balance between preci- 
sion and recall (see Results and Discussion). 

The program also implements several functions to assess 
the statistical significance of separating atypical genes from 
typical ones. For the whole weight population, the pro- 
gram performs Hartigans dip test [55] to assess the non- 
unimodality of the weight distribution, which essentially 
is the statistical significance that a distribution can be 
divided into two or more distinct parts. The test is per- 
formed by calling the "diptest" package in R [56]. The 
dip statistic and the p-value for the test for unimodality 
are reported. For each individual gene, the program 
computes a density-based silhouette (dbs) (using the 
"pdfCluster" package in R), which is a statistical measure 
of confidence that a certain data point (gene) is allocated 
to a cluster (here the atypical region) [57]. 

The same procedures apply to the distribution of the 
distal weight (and the self weight, if the optional rule 3 is 
chosen). In addition to the above described statistical ap- 
proaches, power users may also perform extra statistical 



analyses based on the weight data output by the program, 
and type user-defined cutoffs. 

Based on the cutoffs, the program reports a population 
of genes with an atypical, non-vertical history, which in the 
context of the a priori provided phylogenetic information 
represents a putative horizontal history. The results are 
summarized in a choice of plain text, web page (HTML) or 
Excel spreadsheet formats. The latter allows for convenient 
downstream statistics of outputs, and includes hyperlinks 
that allow users to track each of the genes back to their 
original BLAST report. It not only reports the number 
and percentage of putatively HGT-derived genes, but 
also optionally categorizes each gene in three contexts: 
(i) By putative donor group, which is described by user- 
designated higher taxonomic rank of the best match 
organism (based on GenBank annotations), (ii) By func- 
tional annotation of protein products, which is provided 
by external sources, such as the output of Blast2GO [58]. 
(iii) By gene orthology (evolutionary history of each in- 
dividual gene family across input genomes), which is 
identified by a built-in function of BLAST hits clustering or 
from external sources, such as the output of OrthoMCL 
[59]. These reports (for example see Additional file 1: 
Figure S1G) allow users an intuitive view of the preva- 
lence of HGT-derived genes and the evolutionary, eco- 
logical and functional implications of HGTs at levels 
of individual genes, gene families, whole genomes and 
multiple phylogenetically related genomes. 

The whole analysis can be performed on a regular per- 
sonal computer, as it does not require extensive comput- 
ing power. Batch-BLAST is the most time-consuming step, 
which typically lasts several hours to several days, de- 
pending on the number of protein-coding genes in the 
genome(s) of interest. The computation time can be re- 
duced if faster BLAST alternatives or smaller databases 
are used. The statistical analysis typically takes only minutes. 
Data generated by the program can be parsed and re-used 
by other programs for multiple purposes. 

As an additional, and important function, the program 
provides a complete phylogenetic pipeline, which auto- 
mates the process of multiple sequence alignment, align- 
ment trimming and phylogenetic tree reconstruction, by 
calling external local programs such as ClustalW [60], 
Gblocks [61] and RAxML [62], and parsing their outputs. 
Reconstructed phylogenetic trees are annotated with organ- 
ismal names and are attached to BLAST reports, which in 
turn can be directly viewed by external programs such as 
FigTree [63]. This function allows users to monitor and val- 
idate prediction results by manually checking the evolution- 
ary scenarios of individual genes. 

Analysis of simulated genomic data 

To assess the performance of this method under the impact 
of various evolutionary scenarios, as well as to compare 
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HGTector to conventional BLAST methods, we tested the 
above-described pipeline on simulated genomic data. 

Simulation of genome evolution 

Simulated genomic data were generated by ALF (Artificial 
Life Framework) version 1.0, a program that simulates 
genome evolution [64]. In each simulation, 100 species 
evolved from one randomly generated root genome con- 
taining 1000 protein-coding genes that are no shorter than 
50 aa. During the process random inter-genomic HGT 
events occurred under a pre-defined global rate, which 
varied between simulations (see below). The process also 
simulated the following evolutionary forces in addition to 
HGT, at random rates: speciation, character substitution, 
insertion and deletion, GC-content amelioration, rate 
variation among sites and among genes, gene duplica- 
tion and loss. Many of these forces are known to affect 
HGT prediction [15,31,32]. 

Two types of simulations were performed. First, an 
idealized, pre-defined tree topology was used, in which 
all representatives of close- and se//-designation are grouped 
in a polytomy, signifying (in this case) equal genetic dis- 
tance, thus eliminating the impact from evolutionary arti- 
facts. The rest of the taxa are placed relatively distant 
(unrelated) in the tree to lower the possibility of stochastic 
BLAST matches. The topology is detailed as follows: 10 
clades branch from the tree base simultaneously at time 
point 0 (unit: PAM distance, same below). In each clade, 
10 species branch off simultaneously at time point 90. The 
tips are at time point 100 for all species. For each clade, 
one species was randomly chosen as the input genome 
(also the self group), while nine species were defined as the 
close group. All 90 species in other clades are considered as 
the distal group (Figure 2A). All ten clades were analyzed 
in this manner. This simulation was replicated 10 times 
(=100 analyses), with the HGT rate ranging from 0 
(negative control) to 0.0045 with an interval of 0.0005. 

Second, a randomized birth-death tree was generated 
in ALF per simulation (birth rate = 0.1, death rate = 0.01, 
height = 1000), mimicking a more realistic topological 
situation. A random clade was manually chosen from the 
tree, as long as it met the following criteria: (1) 3-8 self 
species that formed a clear monophyletic group; (2) 10-20 
close species that were closely related to the self clade; (3) 
the self and close species together formed a clear mono- 
phyletic group that was independent from all other species 
(the distal group) (Figure 2B). This simulation was repli- 
cated for 100 times, with the HGT rate of each replicate 
randomly sampled from a range of 0 to 0.005. 

Evolutionary events (HGT, gene duplication, gene 
loss, etc.) simulated in each analysis were extracted 
from the ALF log file. The time, species and genes 
involved in these events were recorded. HGT events 
involving the self group were further categorized by 



their donor, recipient and time into the following 
groups: target HGT (HGT from the distal group to the self 
group, which are the actual positives to be targeted by 
HGTector), ancient HGT (HGT from the distal group to 
the ancestor of the self group), outgoing HGT (HGT from 
the self group to the distal group, which is equivalent in 
consequence to a database error that mislabels a sequence 
with another organism), and secondary HGT (target HGT 
combined with one or more additional transfers between 
the self and the close groups). Gene loss and gene duplica- 
tion events taking place in the close group were singled 
out as they directly influence the distribution of the close 
weight. 

For each simulation (idealized or randomized tree 
topologies), a BLASTP database including the protein 
sequences from all 100 genomes was created using the 
standalone BLAST program [17]. All-against-all BLASTP 
was performed with an E-value cutoff at 1 x 10~ 5 for all 
genes in the selected self genomes, considering at least 20 
hits. In order to demonstrate the effect of gene duplication 
on the prediction result, the program option of excluding 
paralogs was turned off. 

A modified version of the HGTector pipeline was created 
to parse the ALF outputs and to perform analysis. Both the 
conservative and the relaxed cutoff criteria were tested. The 
fingerprint was plotted along with the distribution of actual 
positives (target HGTs) as well as other evolutionary 
scenarios (see above) for manual observation in addition 
to statistical analysis. 

Comparison to conventional BLAST approach 

The performance of HGTector in the context of identifying 
atypical and putatively HGT-derived genes, was compared 
against that of commonly used conventional BLAST-based 
methods by modifying the pipeline to mimic the con- 
ventional approach, which does not consider overall hit 
distribution. Specifically, we considered HGT-e vents for 
the scenario where no close hits are present (criterion: 
C = 0); or the scenario where there is at least one distal 
hit with a bit score higher than those of any close hits 
(criterion: D > C). 

Assessing performance 

The predicting power in the context of target HGTs 
under each criterion, as well as for conventional BLAST 
was assessed by precision and recall. Precision describes the 
number of true positives over the number of all predicted 
positives (=How many of the predicted HGTs are real), 
and recall describes the number of true positives over 
number of all actual positives (=How many of all real 
HGTs were found), which are commonly used perform- 
ance markers for binary classifications. Statistical analysis 
and plotting were conducted using R [42]. 
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Application of HGTector to real datasets 

HGTector was further tested on a variety of real-world 
genomic data, covering bacteria (5), unicellular eukaryotes 
(1) and human (1) (Table 1), the last of which serves as a 
realistic negative control since HGTs into genomes of 
higher animals are known to be very rare [5]. The most in 
depth analysis was conducted on the Rickettsia dataset, 
because it affords comparisons to previous results obtained 
with a variety of methods [29,36,37,39,65,66]. 

Analysis on the Rickettsia dataset 

Out of all available Rickettsia genomes, we selected seven 
representative Rickettsia species with fully-annotated ge- 
nomes for this analysis (Table 1). All of these species belong 
to the spotted fever group (SFG), a traditional classification 
group of Rickettsia. A grouping scenario was chosen based 
on the taxonomy and phylogeny of major Rickettsia species, 
which has been well resolved by recent studies [37,41,67]. 
Specifically, we defined the self group as SFG (NCBI 
taxonomy ID: 114277), the close group as order Rickettsiales 
(766), excluding SFG, and the distal group as all non- 
Rickettsiales organisms. 

All-against-all BLASTP was performed against the NCBI 
non-redundant protein sequences (nr) database with an E- 
value cutoff at 1 x 10" 5 . A soft filtering for low sequence 
complexity regions, which was suggested as the optimal 
parameter for ortholog identification [68], was used. Hits 
with organism names containing these words were ex- 
cluded for the purpose of this analysis: unknown, uncul- 
tured, unidentified, unclassified, environmental, plasmid, 
vector, synthetic, and phage. Up to one hit from each or- 
ganism was preserved. A maximum number of 200 hits 
were preserved for each protein. A global fingerprint was 
computed and graphed to describe the pattern of BLAST 
hit distribution of all seven genomes. Cutoffs for the three 
groups of weights were computed using the built-in kernel 
density estimation function under the conservative criter- 
ion. The default rules 1 and 2 were applied. 

Assessing stochastic events using real datasets 

To test the impact of stochastic events on prediction 
results, the following simulations were carried out on the 
Rickettsia dataset: (1) database error (some sequences are 
mislabeled by incorrect organism names), (2) gene loss 
in the close group, (3) rate variation in the close group, 
(4) rate variation in the input genomes, (5) taxon sampling 
bias (some groups of organisms are overrepresented in the 
genome database), and (6) smaller sample size of genes. 
Multiple degrees of modification intensity (x) for each type 
of events were tested, each having 100 replicates (except for 
taxon sampling bias). Specifically: 

1. Database error. For each hit, there is x proportion of 
chance that its corresponding organism was 



assigned to an organism randomly sampled from the 
pool of BLAST results of all genes. 

2. Gene loss in the close group. For each close hit, 
there is x proportion of chance of being removed 
from the BLAST hit table. 

3. Rate variation in the close group. For each close hit, 
there is x proportion of chance that its bit score is 
divided by a factor subject to a gamma distribution 
with shape parameter k = 2 and scale parameter 8=1: 

Si=S 0 /(l + r(2,l)) 

Where S 0 and Si refer to the bit score before and 
after manipulation, respectively. 

4. Rate variation in the input genomes. For each query 
gene, there is x proportion of chance that the bit 
scores of all its hits are divided by a factor subject to 
a gamma distribution same as above. 

5. Taxon sampling bias. For selected representative 
groups of organisms from close and distal groups 
(see Results and Discussion), all hits belonging to 
this group were replicated into x copies (x identical 
but separate hits). 

6. Smaller sample size of genes, x out of all 8484 genes 
were randomly selected for fingerprint calculation in 
each replicate. The resulting cutoffs were in turn 
used for prediction on all genes. Replicates that 
failed to pass the Hartigans' dip test (that is, the 
population of either close or distal weights is not 
significantly non-unimodal) (p-value threshold = 0.05) 
were counted and excluded from the analysis. 

HGTector analysis was conducted on these replicates 
using the same procedures as in the standard analysis 
on the unmodified dataset (see above). The results 
were compared to the results derived from a conven- 
tional BLAST-based approach under the D > C criterion 
(see Analysis of simulated genomic data). Precision and 
recall of the results were computed using the result of the 
standard analysis as the reference. 

Cross-method comparison 

In order to evaluate the performance of HGTector on 
the Rickettsia dataset in the context of other available 
methods, results were compared to two examples from 
each of the three currently employed strategies: BBH 
(bidirectional best hit) [19] and DarkHorse [28,29] based 
on best BLAST match; GIST [69] and IslandViewer [66] 
based on sequence composition, and two studies conducted 
by Merhej et al. [37] and Le et al. [36], using phylogenetic 
approaches. We exemplified this comparison on the R. felis 
genome, which previously has been demonstrated to have 
high HGT frequency [37,39,65]. 



Table 1 Real genomic datasets tested in this study 



Category Self group Close group No. of genes Date of BLAST Max. no. of hits List of input genomes (organism name and 

NCBI accn. no.) 



Alphaproteobacteria 


SFG Rickettsia 


Rickettsiales 


8484 


Jan. 2013 


200 


R. akari str. Hartford [GenBank:NC_009881] 

R. fells URRWXCal2 [GenBank:NC_007109] 

R. massiliae MTU 5 [GenBank:NC_009900] 

R. slovaca 13-B [GenBank:CP002428] 

R. rickettsii str. 'Sheila Smith' [GenBank:NC_009882] 

R. africae ESF-5 [GenBank:NC_012633] 

R. conorii str. Malish 7 [GenBank:NC_003103] 1 


Firmicutes 


Streptococcus 


Bacilli 


11906 


Nov. 2013 


100 


S. onginosus C238 [GenBank:NC_022239] 
S. gollolyticus UCN34 [GenBank:NC_013798] 
S. intermedius B196 [GenBank:NC_022246] 
5. mutons LJ23 [GenBank:NC_01 7768] 
S. pneumonia A026 [GenBank:NC_022655] 
S. suis JS14 [GenBank:NC_017618] 


Epsi 1 on proteobacteria 


Helicobacter 


Campylobacterales 


10531 


Mar. 2013, Nov. 201 3 2 


200 


H. ocinonychis Sheeba [GenBank:NC_008229] 
H. bizzozeronii CI 1 1-1 [GenBank:NC_015674] 
H. cinaedi PAGU61 1 [GenBank:NC_01 7761] 
H. fells ATCC 49179 [Gen Bank:NC_0 14810] 
H. musteloe 12198 [GenBank:NC_013949] 
H. hepaticus ATCC 51449 [GenBank:NC_004917] 


Gammaproteobacteria 


Erwinia 


Enterobacteriales 


19013 


Mar. 2013 


200 


E amylovora ATCC 49946 [GenBank:NC_013971] 
E. billingiae Eb661 [GenBank:NC_014306] 
E. sp. Ejp617 [GenBank:NC_017445] 
E. pyrifoliae DSM 12163 [GenBank:NC_01 7390] 
E. tasmaniensis Et1/99 [GenBank:NC_010694] 


Actinobacteria 


Mycobacterium africanum 


Mycobacterium 


3830 


Oct. 2013 


100 


M. africanum GM041 182 [GenBank:NC_015758] 


Unicellular red algae 


Galdieria sulphuraria 


Eukaryota 


7174 


Dec. 2013 


50 


G. sulphuraria [GenBank:ASM34128v1] 3 


Higher animal 


Homo sapiens 


Animalia 


22516 4 


Nov. 2013 


1000 


H. sapiens [GenBank:GCF_000001405.13] 



^he genomes used in this study are identical to those used in [37]. 

2 Two independent analyses were conducted on different dates, and similar outcomes were obtained. The more recent result was reported. 
3 The genome used in this study is identical to that used in [79]. 

4 For genes with multiple isoforms, the longest CDS was extracted using an in-house Perl script and used for the analysis. 
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Specifically, BBH analysis was performed as a built-in 
function of the present pipeline. This method is similar 
to the D > C criterion as described above, except for an 
additional reverse BLAST step with the same parameters 
to confirm that the two sequences are each others 
best match within their host genomes. The result by 
DarkHorse was downloaded from the DarkHorse server 
(darkhorse.ucsd.edu). Default parameters were used, in 
which the BLASTP E-value cutoff is also 1 x 10" 5 . All 
three available phylogenetic granularities, strain, spe- 
cies and genus, were used and the results were merged, 
in order to maximize the discovery rate. Both GIST and 
IslandViewer are targeting large pieces of heterogeneous 
genomic regions, or genomic islands (GI) [70]. GIST is a 
synchronization of five subprograms: AlienHunter [71], 
IslandPath [72], SIGI-HMM [73], INDeGenlUS [74] and 
PAI-IDA [75]. The subprograms were executed in a local 
system and the results were processed using the EG ID al- 
gorithm [76] to get a consensus result. IslandViewer is an 
integration of three subprograms: IslandPick, SIGI-HMM 
and IslandPath. The integrated result was downloaded 
from the IslandViewer server (www.pathogenomics.sfu.ca/ 
islandviewer). Results of GIST and IslandViewer were 
further processed by an in-house Perl script to extract 
the genes included in the genomic islands. The puta- 
tive HGT-derived genes predicted by Merhej et al. [37] 
and by Le et al. [36] were extracted from the original 
publications. Specifically, Merhej et al. identified 152 
HGT-derived genes in the R. felis genome that are 
linked with organisms other than SFG Rickettsia [37]. 
Le et al. identified 11 instances of HGT from outside 
Rickettsiales into the R. felis genome [36]. 

The predicted HGT-derived genes or genomic islands 
by different methods were spatially mapped to the R. felis 
genome and visualized in Geneious 6.0 [77]. An "overlap 
factor" (OF) was employed as a criterion to compare the 
outcomes of different methods by assessing the overlap. 
This was expressed as the negative logarithm of the likeli- 
hood that the overlap was obtained by chance. To com- 
pute an OF, the number of the same genes predicted by 
each method pair was counted, and the OF was calculated 
following the probability mass function of the hypergeo- 
metric distribution: 




Where n is the total number of genes; a and b are the 
numbers of genes predicted by two methods, respect- 
ively; k is the number of genes overlapping by two sets 
of results. The larger an OF is, the more overlapping, 
and thus more consistent the two sets of results are, and 
the more likely it is that the two methods are identifying 
the same group of genes. 



Results and discussion 

Performance on simulated genomic data 
Testing precision and recall 

In all experimental groups under the idealized tree 
topology, a clear bimodal distribution was observed 
(Figure 2C), which is expected when HGT is present in the 
data. Meanwhile, none of the negative control groups have 
an identifiable zero peak, which is equivalent to a vertical 
history for all genes (no HGT events). Both cutoff criteria 
achieve high precision and recall simultaneously. In 
particular, under the conservative criterion, 99.4% of 
the prediction results are true positives. Meanwhile, 
they cover over 91.3% of all actual HGT-derived genes. 
The more relaxed criterion still achieves a precision of 
95.3% and a recall of 96.8%. 

In tests under randomized tree topologies, a larger 
transitional zone is present between peaks of the expected 
bimodal distribution, which is also frequently observed in 
real datasets. Both, precision and recall are affected com- 
pared to the idealized scenario, but in different measures. 
The conservative cutoff still maintains reasonably high 
precision (81.6%) and recall (90.5%) simultaneously, relative 
to the known number of HGT events. The relaxed criterion 
keeps equally high recall (96.6%) but its precision drops 
significantly (42.6%) (Figure 4, Additional file 1: Figure S2). 

Because the test under randomized topologies is a 
more realistic representation of datasets, its result serve 
as a better reference for the practical consideration of 
the HGTector application. Given the simultaneously 
high precision and recall, we recommend using the con- 
servative cutoff for both initial HGT candidate screening 
(to be followed by phylogenetic analysis or other analyses) 
and standalone HGT discovery (when further in-depth 
analyses are not applicable). The relaxed cutoff may be 
considered only when the user wants to maximize dis- 
covery rate in an initial screening, in spite of its higher 
false positive rate. The comparison between idealized 
and randomized topologies further indicates the positive 
correlation between prediction success and a properly 
defined grouping scenario, in which: (1) the self-close 
clade is relatively distant from any other organisms; 
and (2) there are multiple subclades in the close group, 
each having similar number of taxa represented in the 
database. 

Varying global HGT rates seemingly show little effect 
on the stability of precision and recall of this method 
(Additional file 1: Figure S3), suggesting that the predicting 
power is independent of HGT rate. 

In comparison, the performance of both conventional 
BLAST-based approaches (C = 0 and D > C) is notably 
unbalanced. Atypical genes falling under the C = 0 cri- 
terion have the highest precision (100.0% and 97.1%, 
for idealized and randomized tree topology, respect- 
ively), but very low recall (78.1% and 73.7%). The D > C 
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Mean: 1.000 0.781 0.994 0.913 0.953 0.968 0.393 0.997 
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Criterion of HGT prediction Criterion of HGT prediction 

Figure 4 Comparison of performance of HGTector and conventional BLAST-based method on simulated genomes. The methods were 
tested on the simulated genomic data under an idealized (A) or randomized (B) tree topology. "Con" and "Rel" represent conservative and 
relaxed cutoffs in the HGTector analysis. "C = 0" and "D > C" are two criteria in the conventional BLAST-based method. Each experimental group is 
composed of 100 tests. The distribution of results in terms of precision (red) and recall (blue) is depicted by box plots. The mean value of each 
group is label above the corresponding column. 



criterion has high recall (99.7% and 98.7%) but ex- 
tremely low precision (39.3% and 33.1%), showing an 
intolerably high false positive rate (Figure 4, Additional 
file 1: Figure S2). From a practical perspective, C = 0 is 
too stringent, thus omitting a big portion of true HGT- 
derived genes affected by stochastic events, while D > C is 
too relaxed and not capable of differentiating genes that 
have high-score distal hits merely due to stochastic rea- 
sons instead of HGT (see below). It should be mentioned 
that C = 0 is not applicable to real datasets due to frequent 
genome annotation errors and ORFans, both of which 
may have a zero close weight. 

Evaluating other evolutionary scenarios 

The impact of other evolutionary events on the fingerprint 
and the division between typical and atypical gene popula- 
tions was explored. Outgoing HGT (from self to distal) 
seemingly does not significantly alter the close weight of 
a gene. Gene loss decreases close weight and gene dupli- 
cation increases it, both within an insignificant range 
(Figure 2C, D). Most importantly, genes within these 
three categories of evolutionary history still fall within 
the typical region and were not mistakenly detected as 
atypical by HGTector. 

It is particularly notable that the majority of genes derived 
from ancient HGT events are located in the transitional 
zone (Figure 2D). Expectedly, they constitute a considerable 



portion of false positives in our analyses. In other words, 
depending on cutoff, more of them are likely to be 
placed in the typical population, instead of the atypical, 
and presumably non-vertical population. A similar pattern 
was observed for secondary HGT events. Although not 
frequent in the simulated genomes here, these events are 
actually very frequent in real datasets, as HGT frequency 
is higher between closely related organisms [78] . However, 
in the conventional BLAST method (D > C) most false 
positives are composed of mainly outgoing HGTs but only 
a few ancient HGTs. 

Application to real datasets 

The close weight distributions for real datasets exhibit a 
bimodal distribution containing a broad typical region 
and a sharp atypical peak that is located close to zero 
(as expected) (Figure 3A-C, Additional file 1: Figure S4). 
Therefore, the cutoffs that divide genes with atypical 
BLAST hit distribution from typical ones can be set accord- 
ingly and HGT events can be predicted based on the 
cutoffs. As an example, fingerprint plots for Rickettsia and 
Galdieria exhibit an apparent overlapping pattern between 
the atypical peak and the HGT-derived genes identified 
by previous phylogenetic studies [37,79] (Additional file 1: 
Figures S4E and S5D). In contrast, the human genome does 
not have an apparent atypical peak, which is also expected 
(Additional file 1: Figure S4F). 
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Analysis of the Rickettsia genomes 

The global fingerprint describing the pattern of BLAST 
hit distribution of the Rickettsia analysis is illustrated in 
Figure 3A-C Results from Hartigans' dip test strongly 
support the non-unimodality of all three weight distri- 
butions (p-values < 2.2 x 10" 16 ). The seven Rickettsia 
genomes contain a total number of 8484 annotated 
chromosomal protein-coding genes, of which 800 genes 
have an atypical close weight and a typical distal weight, 
and thus are potentially HGT-derived (Table 2, Additional 
file 2: Table SI). The percentage of putative HGT-derived 
genes in a genome ranges from 6.05% (76 genes) in R. akari 
to 18.29% (256 genes) in R.felis. The number of putative 
HGT-derived genes per genome is positively correlated to 
the size of the genome (R 2 = 0.755), implying contribu- 
tion of HGT to the relative genome expansions in the 
overall reductive trend of Rickettsia genome evolution, 
confirming previous studies [65]. It is notable that the 
R. felis genome contains significantly more putative 
HGT-derived genes (256) than the rest of the genomes 
(90.7 ± 20.5, mean and standard deviation), suggesting 
a particular prevalence of potential HGT events in R. felis 
evolution. The comparison between the fingerprint on 
R. felis genome alone (Additional file 1: Figure S5A-C) and 
the global fingerprint (Figure 3A-C) clearly reveals that 
R. felis has a much larger atypical peak in the close 
weight distribution. This reinforces outcomes from 
Merhej et al. [37], which found that the frequency of 
cross-species bacterial recombination in R. felis had 
been underestimated. 

To further explore the biological information behind 
the predicted patterns, the results have been summarized 
in three separate contexts: by putative donor group, by 
functional annotation and by gene orthology. Similar to 
previous in depth analyses by Merhej et al. [37], our ana- 
lyses revealed frequent donor links, such as Legionellales, 
Enterobacteriales and Burkholderiales, and frequently 
transferred gene categories, such as genes encoding trans- 
posases and genes involved in phage and plasmid activity 



(Additional file 1: Figures S6 and S7; Additional file 2: 
Table S2). 

Stochastic manipulation of Rickettsia data: stability 
of prediction 

The results of these simulations (Additional file 1: Figures 
S8 and S9) suggest that HGTectors performance is 
notably insensitive to database error. Even under an 
extremely high proportion (10%) of mislabeled genes, 
its precision remains close to 100%. As a comparison, 
the precision of the conventional BL AST-based approach 
D > C drops below 15% (Additional file 1: Figure S8A). 
Gene loss and rate variation in the close group also 
have remarkably weaker deleterious effects on preci- 
sion of HGTector than that of conventional BLAST 
based methods under the D > C scenario (Additional 
file 1: Figure S8B, C). Rate variations in input genomes 
destabilize prediction results at a certain range (5-15%), 
but overall, precision remains high in HGTector when 
rate increases, as compared to conventional BLAST 
(Additional file 1: Figure S8D). Taxon sampling bias 
has little effect when it happens to distal organisms. 
However, in some close organisms (such as R. prowazekii 
and R. bellii), this effect may severely compromise the 
predicting power if the bias is strong (for example, a 
certain taxon group has many more representatives than 
others) (Additional file 1: Figure S9). Smaller sample 
size of input genes has a minor effect on the prediction 
result (Additional file 1: Figure S10), suggesting that the 
method is still effective on incomplete genomes. It should 
be noted that when the sample size is extremely small 
(e.g., 50 genes only), the statistical power of gene cluster- 
ing can be significantly impaired (green bars in Additional 
file 1: Figure S10). 

The above results suggest that our method is generally 
unaffected by stochastic events, some of which are major 
challenges to current methods (see Background). A con- 
cern is taxon sampling bias in the close group, an issue 
that can be alleviated by properly defining grouping 



Table 2 Summary of genes predicted to be horizontally acquired in seven Rickettsia genomes 


Abbreviation 


Size of chromosome (Mb) 


Number of chromosomal 
protein-coding genes 


Number of predicted 
HGT-derived genes 


Percentage of predicted 
HGT-derived genes 


R. akari 


1.23 


1256 


76 


6.05% 


R. felis 


1.49 


1400 


256 


18.29% 


R. massiliae 


1.36 


968 


93 


9.61% 


R. slovaca 


1.28 


1114 


72 


6.46% 


R. rickettsii 


1.26 


1342 


98 


7.30% 


R. africae 


1.28 


1030 


78 


7.57% 


R. conorii 


1.27 


1374 


127 


9.24% 


Mean 


1.31 


1212 


114.3 


9.43% 
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scenario and masking redundant taxonomic groups 
using HGTector s flexible functionality (see Methods - 
Computational pipeline). 

Comparison of HGTector with other methods for analysis of 
the R. felis genome 

Together, HGTector and six other methods identified 
595 putative HGT-derived genes, out of all 1400 chromo- 
somal protein-coding genes in the R. felis genome (42.5%) 
(Table 2, Figure 5, Additional file 2: Table S3). Of these 



595 genes, 82 have been uniquely identified by HGTector. 
There's a considerable portion of overlapping hits between 
each pair of results, but none of the programs produce 
completely identical HGT predictions. As expected, there 
is relatively more overlap between two methods of the 
same category. Meanwhile, the overlap factors (OFs) be- 
tween two methods from different categories are signifi- 
cantly lower. However, HGTector is a notable exception 
because its overlap with any other method of the three 
categories is significantly higher than between other 



1,485, 148 ( 



50,Ooo 
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Figure 5 Predicted HGTs by multiple methods mapped onto the Rickettsia felis genome. A circular view of the whole chromosome of 
R. felis, with genomic islands (GIs) predicted by IslandViewer and GIST indicated by boxes, and putative HGT-derived genes predicted by other 
methods indicated by arrowheads. 
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methods from different categories (Additional file 2: 
Table S4 and Additional file 1: Figure Sll). 

By comparing the results of HGTector and the results 
of the other two BLAST-based methods, it is noticeable 
that HGTector can effectively exclude false positives caused 
by database errors. For instance, BBH and DarkHorse 
respectively predicted 61 and 97 genes to be acquired 
from Ixodes scapular is (deer tick), which is a common 
host of Rickettsia. Therefore, it is likely that the sequenced 
Ixodes samples contained Rickettsia endosymbionts, and 
its DNA was also sequenced and indiscriminately labeled 
as Ixodes DNA in the unassembled shotgun sequences 
( [NCBLPRJNA34667] ) . In other words, these are potential 
database errors. In contrast, in HGTector s result, none of 
these genes were predicted as HGT-derived, because they 
have considerable amount of close hits (violation of rule 1), 
despite the presence of a single high-score Ixodes hit. 

Detection of HGT is fraught with challenges, such as 
ambiguity in compositional features and phyletic pat- 
terns [80], difficulty in phylogenetic reconstruction [81], 
interference from database error and incompleteness 
[49]. The observed low consistency between methods is 
not surprising, and has precedents in multiple previous 
studies [13-15]. Given the significant overlap of HGTector 
with all other previously employed approaches, we suggest 
that it is effective in producing meaningful prediction 
results on real datasets. 

Conclusions 

In this paper a novel method for genome-wide detection 
of vertical versus non-vertical gene history (in particular, 
putative HGT events) is presented. It features a statistical 
analysis of BLAST hit distribution patterns in the context 
of a priori defined phylogenetic hierarchies. The innovation 
of this method is the systematic consideration of all BLAST 
hits of all genes within selected genomes. This is in contrast 
to conventional BLAST-based approaches, which typically 
rely on a single best hit for each gene (see Background). 
The three-category grouping scenario is a simplified but 
effective implementation of prior phylogenetic knowledge 
into a BLAST hit distribution analysis. The set-up allows 
high flexibility in group assignments that best match the 
taxonomic level of the user s interest, as well as immediate 
response to frequent changes in microbial taxonomy. The 
advantage of this systematic approach is that it captures the 
overall image of gene evolution while being significantly 
insensitive to stochastic events. As demonstrated in this 
study, stochastic events such as gene loss, rate variation 
and database error may impose serious problem to con- 
ventional methods (see Background), but have compara- 
tively negligible effects on HGTector. 

The core assumption of the method is that the typical 
and atypical parts of the BLAST hit distribution are dis- 
tinguishable. This assumption is repeatedly supported by 



tests on both simulated and real-world genomic data (Fig- 
ures 2 and 3, Additional file 1: Figure S4). With the pro- 
posed procedures of computing cutoffs and the rules of 
targeting genes that are likely to be HGT-derived, 
remarkable prediction success was achieved (Figures 4 
and 5, Additional file 2: Table S3). Given these results, we 
suggest that HGTector is a useful addition to conventional 
BLAST-based approaches. 

The HGTector pipeline has advantages of speed, automa- 
tion, compatibility and low requirement for computational 
resources, making this program a generally applicable tool 
for discovery of vertical, and non-vertical history of genes, 
as well as initial HGT prediction. It is especially suitable 
for gaining a rapid and comprehensive overview of newly 
sequenced genomes to identify their evolutionary and eco- 
logical linkages with other organisms, facilitating further 
exploration of the functional drivers of the dynamics of 
genome evolution. It has to be made clear that an atypical 
BLAST hit distribution is an empirical observation, rather 
than a strict certification of HGT. Since an HGT predom- 
inantly reflects a past evolutionary event, it is theoretically 
impossible to identify exact gene donors and mechanisms, 
and any analysis is only an approximation to possible 
scenarios. Therefore, we recommend HGTector as a 
discovery tool for a detection of potential HGT-derived 
genes that can be further analyzed with phylogenetic 
approaches. This is much more effective approach than 
a very time-consuming and technically challenging 
process of a priori phylogenetic analysis of all genes 
within all target genomes, which becomes decreasingly 
feasible as more genomic data are present. 
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hit distribution and possible explanations. Figure S2. Precision-recall plot of 
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rate and performance of methods. Figure S4. Fingerprints of genomes of 
various organisms. Figure S5. Distribution of BLAST hit weights of the R. felis 
genome. Figure S6. Heat map indicating percentages of predicted 
HGT-derived genes by putative bacterial donor groups in Rickettsia 
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predicted HGT-derived genes by functional annotations in Rickettsia 
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various simulated stochastic events. Figure S9. Stability of results on 
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